Recently, the Beijing Academy of Artificial Intelligence (BAEI), in collaboration with Shanghai Jiao Tong University and other institutions, officially released a new generation of ultra-long video understanding model - Video-XL-2. The introduction of this model represents a significant breakthrough in the field of open-source ultra-long video understanding technology, injecting new vitality into the development of multimodal large models for understanding long video content. In terms of technical architecture, Video-XL-2 mainly consists of three core components: a visual encoder, the Dynamic Token Synthesis (DTS) module, and a large language model (LLM). The model adopts